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ABSTRACT 

This  paper  presents  an  architecture  and  a  synthesis  method 
for  programmable  numerical  function  generators  of  trigono- 
metric  functions,  logarithm  functions,  square  root,  recipro¬ 
cal,  etc.  Our  architecture  uses  an  LUT  (Look-Up  Table)  cas¬ 
cade  as  the  segment  index  encoder,  compactly  realizes  var¬ 
ious  numerical  functions,  and  is  suitable  for  automatic  syn¬ 
thesis.  We  have  developed  a  synthesis  system  that  converts 
MATLAB-like  specification  into  HDL  code.  We  propose 
and  compare  three  architectures  implemented  as  a  FPGA 
(Field-Programmable  Gate  Array).  Experimental  results  show 
the  efficiency  of  our  architecture  and  synthesis  system. 

1.  INTRODUCTION 

Numerical  functions,  such  as  UigonomeUic  functions,  log¬ 
arithm,  square  root,  reciprocal,  etc,  are  extensively  used  in 
computer  graphics,  digital  signal  processing,  communica¬ 
tion  systems,  robotics,  astrophysics,  fluid  physics,  etc.  High- 
level  programming  languages,  such  as  C  and  FORTRAN, 
usually  have  software  libraries  for  standard  numerical  func¬ 
tions.  However,  for  high-speed  applications,  a  hardware 
implementation  is  needed.  Hardware  implementation  by  a 
single  look-up  table  for  a  numerical  function  f(x )  is  sim¬ 
ple  and  fast.  For  low-precision  computations  of  f(x),  i.e., 
when  x  has  a  small  number  of  bits,  this  implementation  is 
straightforward.  For  high-precision  computations,  however, 
the  single  look-up  table  implementation  is  impractical  due 
to  the  huge  table  size.  For  such  applications,  the  CORDIC 
(Coordinate  Rotation  Digital  Computer)  algorithm  [1,  16] 
has  been  used.  Although  faster  than  software  approaches,  it 
is  iterative  and  therefore  slow. 

This  paper  proposes  an  architecture  and  a  synthesis  for 
NFGs  (Numerical  Function  Generators)  using  linear  approx¬ 
imations.  By  using  the  LUT  cascade  [7, 11],  our  architecture 
can  realize  various  numerical  functions  quickly,  and  is  suit¬ 
able  for  the  automatic  synthesis.  Fig.  1  shows  the  synthe¬ 
sis  flow  for  the  NFG.  It  generates  HDL  (Hardware  Descrip¬ 
tion  Language)  code  from  the  design  specification  described 
by  Scilab  [14],  a  MATLAB-like  numerical  calculation  soft¬ 
ware.  The  design  specification  includes  a  numerical  func¬ 
tion  f(x),  a  domain  [a,  b]  for  the  x,  and  an  acceptable  error. 
This  system  first  partitions  the  given  domain  for  x  into  seg¬ 
ments,  and  then  approximates  the  given  function  f(x)  by  a 
linear  function  for  each  segment.  Next,  it  analyzes  the  er¬ 
ror,  and  derives  the  necessary  precision  for  computing  units 


in  the  NFG.  Then,  it  generates  HDL  code  that  maps  into  an 
FPGA,  using  an  FPGA  vendor  tool. 

This  paper  is  organized  as  follows.  Section  2  introduces 
terminology.  Section  3  proposes  a  linear  approximation  al¬ 
gorithm  for  numerical  functions.  Section  4  shows  three  dif¬ 
ferent  architectures  for  NFGs.  Section  5  describes  the  FPGA 
implementation  method.  Section  6  evaluates  the  performance 
of  our  architecture  and  synthesis  system.  Due  to  the  page 
limitation,  the  error  analysis  for  our  NFGs  is  omitted.  It  is 
available  in  [13].  This  paper  builds  on  [12]. 

2.  PRELIMINARIES 

Definition  2.1  The  binary  fixed-point  representation  of  a 

value  r  has  the  form 

dnSnt—  1  Jnjint—2  •  •  •  til  do  •  d—1  d—  2  •  •  •  d—n.frac  CD 

where  di  €  {0, 1},  nJnt  is  the  number  of  bits  for  the  integer 
part,  and  nffrac  is  the  number  of  bits  for  the  fractional  part 
of  r.  The  representation  in  (1)  is  two’s  complement,  and  so 

n_int — 2 

r  =  —2n-int~  1dnjnt-i  +  2idi ■ 

i=—n.frac 

Definition  2.2  Error  is  the  absolute  difference  between  the 
original  value  and  the  approximated  value.  Approximation 
error  is  the  error  caused  by  a  function  approximation,  and 
rounding  error  is  the  error  caused  by  a  binary  fixed-point 
representation.  Acceptable  error  is  the  maximum  error  that 
an  NFG  may  assume.  Acceptable  approximation  error  is 
the  maximum  approximation  error  that  a  function  approxi¬ 
mation  may  assume. 

Definition  2.3  Precision  is  the  total  number  of  bits  for  a 
binary  fixed-point  representation.  Specially,  n-bit  precision 
specifies  that  n  bits  are  used  to  represent  the  number;  that 
is,  nJnt  +  ri-frac  =  n.  An  n-bit  precision  NFG  has  an 
n-bit  input. 

Definition  2.4  Accuracy  is  the  number  of  bits  in  the  frac¬ 
tional  part  of  a  binary  fixed-point  representation.  Specially, 
m,-bit  accuracy  specifies  that  to  bits  are  used  to  represent 
the  fractional  part  of  the  number;  that  is,  n-ffac  =  to.  An 
m-bit  accuracy  NFG  is  an  NFG  with  m-bit  fractional  part 
of  the  input,  m-bit  fractional  part  of  the  output,  and  a  2~m 
acceptable  error. 
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Fig.  1.  Synthesis  flow  forNFGs. 


3.  LINEAR  APPROXIMATION  ALGORITHM 

For  functions  that  are  approximately  linear,  such  as  sin(x) 
(0  <  x  <  7t/2),  the  linear  approximation  method  yields 
small  approximation  error  with  relatively  few  segments.  In¬ 
deed,  in  such  cases  uniformly  wide  segments  yield  good  ap¬ 
proximations.  Uniform  segments  have  been  used  in  previ¬ 
ous  studies  [2,  5,  15]  to  simplify  the  circuits.  However,  for 
some  kinds  of  numerical  functions  such  as  \J  —  ln(x),  uni¬ 
form  segmentation  requires  too  many  segments.  To  approx¬ 
imate  such  functions  using  fewer  segments,  a  partitioning 
method  of  the  domain  into  non-uniform  segments  is  pro¬ 
posed  [8],  Unfortunately,  their  segmentation  is  fixed;  it  is 
not  optimized  for  the  given  function.  We  improve  on  this  by 
adapting  the  segmentation  so  that  relatively  few  segments 
are  needed.  This  reduces  the  memory  required. 


3.1.  Segmentation  Algorithm 

Fig.  2  presents  the  segmentation  algorithm,  where  the  inputs 
are  a  numerical  function  /(x),  a  domain  [a,  b]  for  x ,  and  an 
acceptable  approximation  error  c.  This  algorithm  begins  by 
forming  one  segment  over  the  whole  domain  [a,  6].  This 
is  an  initial  piecewise  approximation  by  a  linear  function 
whose  endpoints  are  (a,  f(a))  and  ( b ,  f(b)).  If  the  current 
segment  fails  to  provide  the  acceptable  approximation  error, 
it  is  partitioned  into  two  segments  joined  at  a  point  p  where 
the  maximum  error  occurs.  This  process  iterates  until  the 
two  subsegments  approximate  f(x)  to  within  the  acceptable 
approximation  error.  The  correction  values  Vi  are  used  to  re¬ 
duce  the  approximation  errors.  In  Fig.  2,  max  fg  and  min fg 
denote  the  maximum  positive  error  and  the  maximum  nega¬ 
tive  error,  respectively.  These  errors  are  equalized  by  verti¬ 
cal  shift  of  linear  function  g(x)  with  Vi.  In  Fig.  2,  pmo x  and 
Pmin  can  be  found  by  scanning  values  of  x  over  [s,  e].  How¬ 
ever,  it  is  time-consuming.  We  use  a  nonlinear  programming 
algorithm  [6]  to  find  these  values  efficiently. 

The  algorithm  is  based  on  the  Douglas-Peucker  algo¬ 
rithm  [4]  that  is  used  in  rendering  curves  for  graphics  dis¬ 
plays. 


3.2.  Computation  of  Approximated  Values 

A  segment  [si,ef\  is  denoted  by  segp,  thus,  the  segments 
generated  by  the  segmentation  algorithm  are  denoted  by  seg o, 
segi, . . . ,  segt-i-  For  each  segi,  the  numerical  function 
f(x)  is  approximated  by  the  corresponding  linear  function 
9i(x ).  Therefore,  the  approximated  value  y  of  fix)  is  com¬ 
puted  as  follows; 

V  =  gi{x)  =  cux  +  coi,  (2) 


where  gfix )  is  the  linear  function  for  the  segment  segi, 

f(ei)  ~  f{si )  ^  ^ 

Ch  =  -  ,  and  coi  =  /{sf  -  cuSi  + 1 

ei  -  Si 

By  substituting  Coi  into  Equation  (2),  and  simplifying  it,  we 
have 

gi(x)  =  cu(x  -  Si)  +f(si)  +Vi.  (3) 

Let  h  =  ei  —  Si  and  h  — >  0.  Then,  we  have  cu  =  /'{sf).  By 
substituting  this  equation  into  Equation  (3),  we  have  gfix)  = 
f'(si)(x  —  sf)  +  /(sf)  +  Vi .  This  is  the  first-order  Taylor 
expansion  around  x  =  Si  for  f(x)  with  the  correction  value 
Vi .  Our  algorithm  can  approximate  fix)  with  any  accept¬ 
able  approximation  error  using  sufficiently  many  segments. 

4.  ARCHITECTURE  FOR  NFGS 

4.1.  Overview 

Although  Equation  (2)  and  Equation  (3)  represent  the  same 
values,  the  architectures  for  the  NFGs  realizing  them  are  dif¬ 
ferent.  Fig.  3  (a)  shows  the  architecture  for  Equation  (2);  it 
uses  four  units:  the  segment  index  encoder  that  computes  the 
index  i  for  segment  segi  including  the  value  x;  the  coeffi¬ 
cients  table  for  cu  and  CoB  the  multiplier;  and  the  adder.  On 
the  other  hand.  Fig.  3  (b)  shows  the  architecture  for  Equa¬ 
tion  (3);  it  uses  five  units:  the  four  units  used  in  Fig.  3(a), 
where  —  Si  is  stored  in  the  coefficients  table,  and  an  addi¬ 
tional  adder  for  computation  of  x  +  (—Si).  In  Equation  (3), 
when  Si  =  (the  most  significant  ( n  —  k)  bits  of  x)  x  2k ,  the 
index  i  of  the  segment  segi  is  equal  to  the  most  significant 
(n  —  k)  bits,  and  (x  —  sf)  is  equal  to  the  least  significant 
k  bits  of  x,  where  x  has  the  n-bit  precision.  Therefore,  in 
this  case,  the  linear  approximations  are  realized  using  only 
three  units  as  shown  in  Fig.  3  (c):  the  coefficients  table  for 
Cu  and  f{sf)  4-  vp,  the  multiplier;  and  the  adder.  Note  that 
this  architecture  realizes  a  uniform  segmentation. 

We  use  the  architecture  shown  in  Fig.  3  (b)  to  produce 
fast  and  compact  NFGs.  In  Section  6,  we  will  compare  the 
performances  of  three  different  architectures. 

4.2.  Segment  Index  Encoder 

A  segment  index  encoder  converts  an  input  value  x  into  a 
segment  index  i  for  segi.  It  realizes  the  segment  index  func¬ 
tion  seg-func(x)  :  Bn  — >  {0, 1, . . . ,  t  —  1}  shown  in  Fig.  4 
(a),  where  x  has  n-bit  precision,  B  =  {0, 1},  and  t,  denotes 
the  number  of  segments.  In  [8],  to  simplify  the  segment  in¬ 
dex  encoder,  the  values  of  Si  and  e-(  are  restricted.  That  is, 
the  restrictive  non-uniform  segmentation  is  used  for  the  seg¬ 
ment  index  encoder.  Such  segmentation  increases  the  num¬ 
ber  of  segments  and  is  unsuitable  for  automatic  segmenta¬ 
tion.  Our  synthesis  system  uses  the  LUT  cascade  [7,  11,  12] 
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Input:  Numerical  function  f(x).  Domain  [a,  b]  for  x,  Acceptable  approximation  error  c. 

Output:  Segments  [so,  eo],  [si,  ei], . . . ,  [st_i,  et-i],  Correction  values  Vo,  Vi, . . . ,  Vt-i 

Process:  This  is  recursive  procedure.  Initial  segment  is  set  to  [a,  b ]. 

1.  For  given  segment  [s,  e],  compute  a  line  connecting  two  points  (s,  f(s))  and  (e,  /(e)),  represented  by  linear 

function  g(x)  =  cix  +  Co,  where  ci  =  .  co  =  f(s)  —  C\s. 

2.  Find  a  value  Pmax  of  the  variable  x  that  maximizes  f(x)  —  g(x)  in  [s,  e], 
and  let  maxfg  =  f(pmax)  ~  g{Pmax)~  where  maxfg  >  0. 

3.  Similarly,  find  a  pmin  that  minimizes  f(x)  —  g(x),  and  let  minfg  =  fip-min )  —  giPmin ),  where  mirifg  <  0. 

4.  Letp  =  p^x  if  \maxfg\  >  \minfg\,  and  letp  =  pmin  otherwise. 

5.  Let  error  =  \maxfg  —  mirifg\/2,  and  v  =  ( maxfg  +  mirifg)/2. 

6.  If  error  <  c,  then  declare  [s,  e]  to  be  a  completed  segment.  If  all  segments  are  completed,  stop. 

7.  For  any  segment  [s,  e]  that  is  not  completed,  partition  [s,  e]  into  two  segments  [s .  p]  and  [p,  e],  and  iterate  the 

same  process  for  each  new  segment  recursively. 


Fig.  2.  Segmentation  algorithm  for  the  domain. 


(a)  Architecture  for  y  =  cux  +  Ca 
(non-uniform  segmentation). 


X 


y 


(b)  Architecture  for 
y  =  ci i(x  -  Si)  +  f(si)  +Vi 
(non-uniform  segmentation). 


x 


y 


(c)  Architecture  for 
y  =  d i(x  -  Si)  +  f(si)  +Vi 
(uniform  segmentation). 


Fig.  3.  Three  architectures  for  NFGs. 

shown  in  Fig.  4  (b)  to  realize  any  seg.func{x).  It  can  be 
designed  by  functional  decomposition  using  BDDs  (Binary 
Decision  Diagrams)  representing  seg.f  unc{x).  That  is,  our 
synthesis  system  uses  the  nonrestrictive  segmentation.  This 
is  suitable  for  automatic  synthesis.  In  LUT  cascades,  the  in¬ 
terconnecting  lines  between  adjacent  LUTs  are  called  rails. 

The  size  of  an  LUT  cascade  depends  on  the  number  of  rails. 

Thus,  to  produce  a  compact  LUT  cascade,  a  small  number 
of  rails  is  sought.  The  next  theorem  shows  that  the  segment 
index  functions  are  realized  by  compact  LUT  cascades. 


Interval 

Index 

so  <  x  <  eo 

0 

si  <  x  <  ei 

1 

St— i  <  x  <  et-i 

t-  1 

(a)  Segment  index  function.  (b)  LUT  cascade. 

Fig.  4.  Segment  index  encoder. 


Theorem  4.1  [12]  Let  seg-func(x)  be  a  segment  index  func¬ 
tion  with  t  segments.  Then,  there  exists  an  LUT  cascade  for 
seg-func(x)  with  at  most  [log2  t]  rails. 

Our  synthesis  system  uses  heterogeneous  MDDs  (Multi-valued 
Decision  Diagrams)  [10]  to  find  compact  LUT  cascades. 
Since  the  LUT  cascade  is  suitable  for  pipeline  processing,  it 
offers  a  fast  and  compact  circuit.  Experimental  results  will 
show  that  LUT  cascades  have  sizes  comparable  for  the  seg¬ 
ment  index  encoder  using  uniform  segmentation  for  certain 
functions,  like  trigonometric  functions,  and  much  smaller 
sizes  for  other  functions,  like  \/x. 


5.  IMPLEMENTATION  WITH  FPGAS 


Modern  FPGAs  consist  of  logic  elements,  synchronous  mem¬ 
ory  blocks,  multipliers  (DSP  units),  etc.  Our  synthesis  sys¬ 
tem  efficiently  generates  NFGs  using  these  components.  Each 
unit  for  the  NFG  shown  in  Fig.  3  (b)  is  implemented  by  the 
following  components  in  an  FPGA:  1)  Segment  index  en¬ 
coder  (LUT  cascade)  and  coefficients  table  (ROM):  by  syn¬ 
chronous  memory  blocks;  2)  Multiplier:  by  DSP  units;  and 
3)  Adder:  by  logic  elements.  Our  synthesis  system  derives 
the  optimum  bit- width  for  each  component  by  automatic  er¬ 
ror  analysis  [13]. 
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Table  1.  Number  of  pipeline  stages  for  NFGs. 


Name  of  units 

Pipeline  stages 

1. 

LUT  cascade 

n.cas 

2. 

Coefficients  table 

i 

3. 

Adder  for  x  +  (— Si) 

i 

4. 

Multiplier  for  Cu(x  —  Si) 

i 

5. 

Shifter  (optional) 

0  or  1 

6. 

Two’s  complementer  (optional) 

0  or  1 

7. 

Adder  for  ci  i(x  —  Si)  +  cm 

1 

Total  pipeline  stages 

njcas  +  4 
~  njcas  +  6 

ri-cas:  Number  of  LUTs  forLUT  cascade. 


5.1.  Size  Reduction  of  Multiplier 

Although  modern  FPGAs  have  dedicated  multipliers,  large 
multipliers  are  slow.  In  our  architecture,  the  multiplier  of¬ 
ten  has  the  longest  delay  time  among  all  the  units.  Thus,  to 
generate  a  fast  NFG,  reducing  the  size  of  the  multiplier  is 
important.  Since  the  size  of  multiplier  depends  on  the  num¬ 
ber  of  bits  for  ci  i ,  we  reduce  the  number  of  bits  for  ci  i . 

First,  we  consider  the  case  where  the  absolute  value  of 
Cu  is  large.  When  Ci,;|  is  large,  many  bits  are  required  to 
represent  Cu  in  binary  fixed-point.  To  reduce  the  number 
of  bits  for  such  cu,  we  use  the  scaling  method  shown  in  [8]. 
When  |cii|  is  large,  we  represent  cu  as  cu  =  Cux2~l‘  x2l\ 
Instead  of  the  original  value  of  cu,  we  store  the  values  of 
Cu  x  2~li  and  l,  in  the  coefficients  table.  In  this  case,  the 
product  cu(x  —  Si)  is  computed  using  the  multiplier  for 
Cu  x  2~li  x  (x  —  Si)  and  the  shifter  for  an  U- bit  shift  to 
the  left.  The  increase  of  /;  reduces  the  number  of  bits  to  rep¬ 
resent  the  value  of  cu  x  2~1' ,  while  increasing  the  rounding 
error.  Our  synthesis  system  finds  the  optimum  value  of  li  for 
each  segment  segi  within  the  acceptable  error  [13],  When 
the  optimum  values  of  li  are  0  for  all  the  segments  segi •  no 
shifter  is  implemented,  that  is,  Cu(x  —  Si)  is  directly  imple¬ 
mented  with  the  multiplier. 

Next,  we  consider  the  case  where  the  range  of  cu  in¬ 
cludes  negative  values.  In  this  case,  our  synthesis  system 
stores  the  absolute  value  of  cu  and  the  sign  bit  for  cu  sep¬ 
arately  in  the  coefficients  table,  and  first  uses  the  unsigned 
multiplier  to  compute  \cu\{x  —  Si),  and  then  a  two’s  comple¬ 
menter  to  produce  the  signed  value  with  the  sign  bit.  When 
Cu  is  positive  for  all  segments  segi,  no  two’s  complementer 
is  implemented.  That  is,  Cu(x  —  Si)  is  directly  implemented 
with  an  unsigned  multiplier. 

For  simplicity.  Fig.  3  omits  the  schemes  for  the  scaling 
method  and  the  two’s  complementer. 


5.2.  Pipeline  Processing 

To  implement  a  high-throughput  NFG,  our  synthesis  system 
inserts  pipeline  registers  between  all  the  units  in  the  archi¬ 
tecture.  Since  all  the  units  operate  in  parallel,  and  each  unit 
has  a  short  delay  time,  our  NFGs  achieves  high  throughput. 
Table  1  shows  the  units  and  the  number  of  pipeline  stages 
for  them.  Our  NFGs  may  have  from  njcas  +  4  to  njcas  +  6 
pipeline  stages,  where  njcas  is  the  number  of  LUTs  for  the 
LUT  cascade. 


Table  3.  Numbers  of  segments  for  non-uniform  and  uniform 
segmentations. 


Acceptable  approximation  error  :  2  17 

Function 

fix) 

Domain 

Number  of  segments 

Non-uniform 

Uniform 

sin(7rx) 
cos  (7m 
tan(7nr) 
l/x 
l/Ux 
\/x 

\J~  ln(x) 

[ 

0, 1/2 
0, 1/2 
0,1/4 
1/8,1 
1/32,1 
[0,1] 
(0,1] 

] 

127 

127 

112 

702 

620 

231 

584 

257 

257 

257 

3585 

7937 

32769 

32768 

6.  EXPERIMENTAL  RESULTS 


6.1.  Computation  Time  for  Segmentation  Algorithm 

Table  2  shows  the  CPU  time  for  the  segmentation  algorithm 
applied  to  12  of  the  14  functions  used  in  [12]  with  various 
acceptable  approximation  errors.  In  this  table,  the  Sigmoid 
and  the  Gaussian  are  defined  as  follows: 

1  1  _2£ 

Sigmoid  =  - —  ,  Gaussian  =  , _ e  2 

1  +  e~4x  v'SF 

The  segmentation  algorithm  is  recursive,  and  computa¬ 
tion  time  depends  on  the  number  of  segments.  Smaller  ac¬ 
ceptable  approximation  error  requires  more  segments  and 
longer  computation  time.  However,  Table  2  shows  that  for 
all  the  functions  in  the  table,  the  CPU  times  were  smaller 
than  2  seconds  when  the  acceptable  approximation  error  was 
2-25.  These  results  show  that  our  segmentation  algorithm 
generates  non-uniform  segments  quickly. 

6.2.  Comparison  of  Three  Architectures 

This  section  compares  the  three  architectures  for  NFGs  shown 
in  Fig.  3.  Let  Arc_A,  Arc_B,  and  Arc_C  denote  the  archi¬ 
tectures  shown  in  Fig.  3  (a),  (b),  and  (c),  respectively.  To 
compare  these  architectures  for  various  functions,  we  imple¬ 
mented  16-bit  precision  NFGs  on  the  same  FPGA  (Altera 
Stratix  EP1S10F484C5),  using  an  the  acceptable  approxi¬ 
mation  error  of  2~17  for  each  function. 

Table  3  compares  the  numbers  of  segments  for  the  non- 
uniform  and  the  uniform  segmentations.  It  shows  that,  for 
all  the  functions,  the  number  of  non-uniform  segments  is 
less  than  the  half  that  of  uniform  segments.  Since  Arc_A 
and  Arc  _B  use  non-uniform  segmentation,  they  implement 
various  numerical  functions  with  small  coefficients  tables. 
On  the  other  hand,  Arc_C  uses  uniform  segmentation.  Thus, 
although  Arc_C  implements  functions,  such  as  trigonomet¬ 
ric  functions,  with  a  relatively  small  coefficients  table,  it  re¬ 
quires  a  large  coefficients  table  for  other  functions.  In  this 
experiment,  there  were  not  enough  memory  blocks  in  the 
FPGA  (EP 1 S 1 0F484C5)  to  implement  the  non-trigonometric 
functions  using  Arc_C. 

Tables  4  and  5  compare  the  amount  of  hardware  and  per¬ 
formances  for  the  three  architectures.  These  tables  show 
that,  for  trigonometric  functions,  Arc_C  implements  the  short¬ 
est  latency  and  most  compact  NFGs  among  the  three,  since 
Arc_C  requires  no  segment  index  encoder.  Therefore,  when 
the  number  of  uniform  segments  is  relatively  small,  Arc_C  is 
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Table  2.  CPU  time  [msec]  for  the  segmentation  algorithm. 


■nun 

Domain 

AAE  =  2-y 

AAE  =  2~L< 

AAE  =  2~'zb 

CPU  time 

E 19 

CPU  time 

CPU  time 

2X 

1/x 

i!i: 

\/X 

\J  —  ln(a:) 

iogafaO 

ln(x) 
sinGra;) 
cos  Gran 
tan(7rx) 
Sigmoid 
Gaussian 

r  [0.  !]  n 
[VM] 
[1/32,1] 
fo,i] 

0,1 

[1,2 

M 

[M/2] 

0,1/2 

0,1/4 

[°,  i] 
[0,1/2] 

8 

39 

31 

12 

23 

8 

6 

8 

8 

7 

8 

2 

0.1 

0.1 

0.1 

0.1 

0.1 

0.1 

0.1 

0.1 

0.1 

0.1 

0.1 

0.1 

128 

702 

620 

231 

584 

128 

89 

127 

127 

112 

127 

32 

0.1 

30 

20 

10 

40 

10 

0.1 

10 

10 

10 

10 

10 

2048 

11218 

9946 

3941 

12089 

2048 

1437 

2027 

2027 

1787 

2020 

512 

80 

280 

300 

110 

1840 

70 

30 

50 

50 

50 

60 

10 

AAE:  Acceptable  Approximation  Error.  #seg.:  Number  of  segments. 

Experiment  environment 

CPU:  Pentium4  Xeon  2.8GHz  memory  :  4GB 

OS:  Redhat  (Linux  7.3)  C  compiler  :  gcc  -02 


Table  4.  Amount  of  hardware  for  three  architectures. 


Precision,  Accuracy: 

FPGA  device: 

Logic  synthesis  tool: 

16-bit  precision,  15-bit  accuracy 

Altera  Stratix  (EP1S10F484C5) 

Altera  OuartusII  4. 1  (default  option) 

■iziil 

Arc_A 

Arc_B 

Arc_C 

#LEs 

Memory 

#DSPs 

#LEs 

Memory 

#DSPs 

#LEs 

Memory 

#DSPs 

sin(7rx) 

106 

19355 

8 

107 

20061 

2 

82 

14848 

2 

cos  Gran 

136 

19543 

8 

116 

20169 

2 

67 

15417 

2 

tan(7ran 

106 

19355 

8 

116 

20039 

2 

83 

29696 

2 

1/x 

153 

8 

172 

172119 

2 

112 

278594 

2 

1/yx 

182 

159826 

8 

183 

160861 

2 

145 

557119 

2 

191 

2 

175 

44359 

2 

195 

1048576 

0 

\/~  ln(x) 

226 

164944 

8 

230 

164957 

2 

206 

1114112 

0 

The  domains  of  functions  f(x )  are  the  same  as  Table  3. 

#LEs:  Number  of  logic  elements.  Memory  :  Memory  size  [bit], 

#DSPs:  Number  of  9-bit  x  9-bit  DSP  units. 


smaller  and  faster  than  Arc_A  and  Arc_B.  However,  Arc_C 
cannot  implement  the  square  root  or  reciprocal  functions  us¬ 
ing  the  FPGA  due  to  the  excessive  size  of  the  coefficients 
tables.  In  Table  5,  for  y/x  and  \J  —  ln(a'),  Arc_C  used  large 
single  look-up  tables.  From  these  results,  we  can  see  that 
Arc_C  is  suitable  only  for  trigonometric  functions,  and  is 
unsuitable  for  square  root,  reciprocal,  etc. 


Arc _B  implements  various  functions  with  fewer  DSP  units 
than  Arc_A,  because  Arc  _B  requires  a  smaller  multiplier  than 
Arc_A.  Note  that  the  FPGA  synthesis  system  uses  more  DSP 
units  for  a  multiplier  with  more  bits.  Thus,  Arc_B  offers 
a  fast  and  compact  implementation.  In  Arc_A,  for  all  the 
functions  except  for  \fx,  the  multiplier  has  the  longest  de¬ 
lay  time  among  all  units.  On  the  other  hand,  in  Arc_B,  for 
all  the  functions,  the  multiplier  is  not  the  slowest  unit,  and 
the  coefficients  table  or  the  adder  has  the  longest  delay  time 
among  all  units.  For  \J x ,  Arc_A  that  has  a  smaller  coeffi¬ 
cients  table  was  faster  than  Arc  _B.  From  these  results,  we 
can  conclude:  1)  To  implement  a  fast  NFG  with  an  FPGA, 
the  size  reduction  of  multiplier  size  is  important.  2)  Arc _B 
is  the  most  efficient  for  various  numerical  functions  among 
three  architectures. 


6.3.  Comparison  with  an  Existing  Method 

To  show  the  efficiency  of  our  automatic  synthesis  system, 
we  compare  our  NFGs  with  ones  reported  in  [8].  NFGs  in 
[8]  are  also  based  on  non-uniform  segmentation,  while  they 
are  designed  by  hand.  We  generated  the  NFGs  with  the  same 
precision  as  [8].  Table  6  shows  that  our  NFGs  have  compa¬ 
rable  performances  to  [8],  Our  system  generated  24-bit  pre¬ 
cision  NFGs  with  the  operating  frequency  of  more  than  125 
MHz  for  some  functions  in  [12].  Due  to  the  page  limitation, 
the  results  are  omitted. 


7.  CONCLUSION  AND  COMMENTS 

We  have  proposed  an  architecture  and  a  synthesis  method 
for  programmable  NFGs  for  trigonometric  functions,  log¬ 
arithm  functions,  square  root,  reciprocal,  etc.  Our  archi¬ 
tecture  using  an  LUT  cascade  compactly  realizes  various 
numerical  functions,  and  is  suitable  for  automatic  synthe¬ 
sis.  Experimental  results  show  that:  1)  Our  architecture  effi¬ 
ciently  implements  NFGs  for  wide  range  of  numerical  func¬ 
tions;  and  2)  Our  synthesis  system  generates  the  NFGs  with 
comparable  performance  to  those  designed  by  hand. 

Currently,  we  are  working  for  the  NFGs  using  the  quadratic 
approximation  algorithm  to  reduce  the  memory  size. 
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Table  5.  Comparison  of  performances  for  three  architectures. 


Precision,  Accuracy: 

FPGA  device: 

Logic  synthesis  tool: 

16-bit  precision,  15-bit  accuracy 

Altera  Stratix  (EP1S10F484C5) 

Altera  OuartusII  4.1  (default  option) 

mtmmi 

Arc_A 

Arc_B 

Arc_C 

IS1H 

#stages 

Latency 

■asm 

#stages 

Latency 

Frea. 

#stages 

Latency 

sin(nx) 

124 

7 

56 

185 

8 

43 

188 

3 

16 

cos  (the) 

126 

8 

64 

187 

9 

48 

184 

4 

22 

tan  (7 tx) 

125 

7 

56 

190 

9 

47 

183 

3 

16 

1  lx 

125 

8 

64 

179 

9 

50 

- 

4 

- 

i/v7 

124 

9 

73 

178 

10 

56 

— 

4 

— 

182 

8 

44 

179 

9 

50 

- 

1 

- 

V-  ln(x) 

125 

9 

72 

176 

10 

57 

- 

1 

- 

The  domains  of  functions  /(x)  are  the  same  as  Table  3. 

shows  that  the  function  could  not  be  implemented. 

Freq.:  Operating  frequency  [MHz],  #stages:  Number  of  pipeline  stages. 

Latency:  [nsec]. 


Table  6.  Performance  comparison  with  existing  method. 


FPGA  device:  Xilinx  Virtex-II  (XC2V4000-6) 

Logic  synthesis  tool:  Xilinx  ISE  6.3  (default  option) 


m 

Domain 

mm 

Out 

prec. 

Our  method 

Method  in 

[81 

inn 

Frac 

KSTB 

Frac 

#stages 

Latency 

Frea. 

#stages 

Latency 

V-  in(x) 

nmu 

i 

32 

3 

5 

mm 

20 

163 

133 

14 

105 

sin(27rx) 

0 

16 

1 

8 

m 

10 

65 

133 

14 

105 

cos(27tx) 

M 

0 

16 

1 

8 

in 

11 

67 

133 

14 

105 

In  prec.:  Precision  of  input.  Out  prec.:  Precision  of  output. 

Int:  nJnt  Frac:  n_frac 
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